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Abstract 

We focus the use of row sampling for approximating matrix algorithms. We give applications 
to matrix multipication; sparse matrix reconstruction; and, £2 regression. For a matrix A G 
jgmxd w hi cn represents m points in d <C m dimensions, all of these tasks can be achieved in 
0{md 2 ) via the singular value decomposition (SVD). For appropriate row-sampling probabilities 
(which typically depend on the norms of the rows of the m x d left singular matrix of A (the 
leverage scores), we give row-sampling algorithms with linear (up to poly log factors) dependence 
on the stable rank of A. This result is achieved through the application of non-commutative 
Bernstein bounds. 

We then give, to our knowledge, the first algorithms for computing approximations to the 
appropriate row-sampling probabilities without going through the SVD of A. Thus, these are 
the first o(md 2 ) algorithms for row-sampling based approximations to the matrix algorithms 
which use leverage scores as the sampling probabilities. The techniques we use to approximate 
sampling according to the leverage scores uses some powerful recent results in the theory of 
random projections for embedding, and may be of some independent interest. We confess that 
one may perform all these matrix tasks more efficiently using these same random projection 
methods, however the resulting algorithms are in terms of a small number of linear combinations 
of all the rows. In many applications, the actual rows of A have some physical meaning and so 
methods based on a small number of the actual rows are of interest. 



1 Introduction 



Matrix algorithms (eg. matrix multiplicat i on, S VD, £2 regression) are of wi despread use in 



Achlioptas et aZ.I( 2001 



2008 



application areas: dat a mining (lAzar et al\.\200l\): recommendations systems (IDrineas et al 



information retrieval (|Berry et all . Il995l: iPapadimitriou et all. 120001) : we b search iKleinbergj ( 1999) ; 



); clustering dPrineas et q/]j2004l : lMcSherrvlJ200l] ) ; mixture modeling (jKannan et all 



man y 



2002); 



Achlioptas and McSherryl . 120051 ): etc. Based on the importance of matrix algorithms, there 
has been consi derable research energy expe nded on breaking the 0(md 2 ) bound required by exact 
SVD methods dGolub and Van LoailflB. 



Starting with a seminal result of iFrieze et al. ( 19981 ) . a larg e number of results using non- 
uniform sampling to speed up matrix computations hav e appeared (lAchlioptas and McSherryl. 12 007; 



Deshpande et aUl2006l:lDeshpande and Vempalal . l2006l : lDrineas et al 



20071 : iMagen and Zouziasl . l20Tol ). some of which give relative error guarantees (|Deshpande et al 



2006a 



bed 



Rudclson and Vershynin 
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2006 : Deshpande and Vempala . 2006 ; Drineas et ai , 2006d i; Magen and Zouziasl 2010l ). So far 
none of the row-sampling based methods which use non-uniform probabilities generated according 
to the norms of the rows in the left singular matrix have yielded practically efficient algorithms, 
although their row sam pling complex ities (number of rows to be sampled) are impressive. 

Even more recently, Sarlos ( 20061 ) showed how random projections or "sketches" can be used 
to perform all these tasks efficiently, obtaining the first o(md 2 ) algorithms when preserving the 
identity of the rows the mselves are not important . In fact, we will find many of these techniques, 
together with those in lAilon and Chazellel (120061 ) essential to our algorithm for generating row 
samples ultimately leading to o(md 2 ) algorithms based on row-sampling. From now on, we focus 
on row-sampling algorithms. 

We start with the basic result of matri x multiplication. All other results more or less follow from 
here. In an independent recent work b ylMagen and Zouziasl (120101) wh ich is developed along the 
lines of using isoperimetric inequali ties ( Rudelson and Vershynin . 20071 ) to obtain matrix Chernoff 
bounds, Magen and Zouzias ( 2010l ) show that by sampling nearly a linear number of rows, it is 
possible to obtain a relative error approximation to matrix multiplication. Specifically, let A E 
M mxdi and B g r xd 2i Then for r > ( plog di+d2.y e 2 ( w h ere p bounds the stable (or "soft") rank 

of A, B - see later), there is a probability distribution over 1 = {1, . . . , m} such that by sampling 
r rows i.i.d. from I, one can construct sketches A, B such that A B ~ A T B. Specifically, with 
probability at least 1 — 6, 



A B 



A T B|| 2 < e || A 



B 



2" 



The sampling distribution is relatively simple, relying only on the product of the norms of the 
rows in A and B. This result is applied to low rank matrix reconstruction and ^-regression where 
the required sampling distribution needs knowledge of the SVD of A and B. It is not known 
how to sample efficiently from these so-called "expensive" probabilities, because to compute the 
probabilities requires going through the SVD which inflates the running times. 

Our basic result for matrix multiplication is very similar to this, and we arrive at it through a 
different path using a non-commutative Bernstein bound. Our sampling probabilities are different. 
In appication of our results to sparse matrix reconstruction and £2 -regression, the rows of the 
left singular matrix make an appearance. We show how to approximate these probabilities using 
random projections at the expense of a poly-logarithmic factor in running times. Specifically, if 
Ui, • • • > u m are the rows of the left singular vectors of a matrix A, we show how, in o(md 2 ), to 
approximate the "leverage scores" u^/d to within a poly-log factor. It turns out that such an 
approximation is sufficient to obtain all the algorithms with only a poly-log bloat in efficiency. Any 
improvement in the approximation of these probabilities directly translates to improvements in such 
row-sampling based algorithms. To our knowledge, this is the first attempt to bring row-sampling 
based algorithms to within the realm of o(md 2 ). To do so, we will use some powerful re s ults i n fast 
metric preserving embeddings ( Johnson and Lindenstrauss ( 19841 ): Ailon and Chazelle ( 20061 )). 



1.1 Basic Notation 

Before we can state the results in concrete form, we need some preliminary conventions. In general, 
e £ (0, 1) will be an error tolerance parameter; (3 € (0, 1] is a parameter used to scale probabilities; 
and, c, d > are generic constants whose value may vary even within different lines of the same 
derivation. Let ei, . . . , e m be the standard basis vectors in R m . Let A E ^ mxd denote an arbitrary 
matrix which represents m points in M. d . In general, we might represent a matrix such as A (roman, 
uppercase) by a set of vectors ai, ... ,a m G M. d (bold, lowercase), so that A T = [ai a2 ... a m ]; 
similarly, for a vector y, y T = [yi, . . . ,y m ]. Note that at is the row of A, which we may also 
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refer to by Any, similarly, we may refer to the t^ 1 column as A®. Let rank(A) < min{m, d} be the 
rank of A; typically m 3> d and for concreteness, we will assume that rank(A) = d (all the results 
easily generalize to rank(A) < d). For matrices, we will use the spectral norm, || • ||; On occasion, 
we will use the Frobenius norm, || • \\ F . For vectors, || • ||^ = || • || (the standard Euclidean norm). 
The stable, or "soft" rank, p(A) = || A f F /\\ A f < rank(A). 
The singular value decomposition (SVD) of A is 

A = UaSaVI. 

where is an m x d set of columns which are an orthonotmal basis for the column space in 
A; Sa is a d x d positive diagonal matrix of singular values, and V is a d x d orthogonal matrix. 
We refer to the singular values of A (the diagonal entries in Sa) by <7j(A). We will call a matrix 
with orthonormal columns an orthonormal matrix; an orthogonal matrix is a square orthonormal 
matrix. In particular, U a Ua = V a Va = VaV a = Idxd- It is possible to extend Ua to a full 
orthonormal basis of M. m , [Ua, Ua]. 

The SVD is important for a number of reasons. The projection of the columns of A onto the 
k left singular vectors with top k singular values gives the best rank-A; approximation to A in the 
spectral and Frobenius norms. The solution to the linear regression problem is also intimately 
related to the SVD. In particular, consider the following minimization problem which is minimized 
at w*: 

Z* = min I Aw — y || 2 . 

w 

Lemma 1 (jOolub and van Loan! (Il99fih ). Z* = || U^(U^) T y f, and w* = VaS^U^. 



Row-Sampling Matrices Our focus is algorithms based on row-sampling. 
matrix Q G R rxm samples r rows of A to form A = QA: 



A row-sampling 



Q 



A = QA 



~r\A~ 






f r A_ 







where rj = A^.e^. ; it is easy to verify that the row rJA samples the v- row of A and rescales it. We 
are interested in random sampling matrices where each is i.i.d. according to some distribution. 
Define a set of sampling probabilities pi, . . . ,p m , with Pi > and YliLiPi = 1; then rj = e t / ^/rp t 
with probability pt- Note that the scaling is also related to the sampling probabilities in all the 
algorithms we consider. We can write Q T Q as the sum of r independently sampled matrices, 



where r^rj is a diagonal matrix with only one non-zero diagonal entry; the t diagonal entry 
is equal to 1/pt with probability pt- Thus, by construction, for any set of non-zero sampling 
probabilities, Efr^rJ] = I mX m- Since we are averaging r independent copies, it is reasonable to 
expect a concentration around the mean, with respect to r, and so in some sense, Q T Q essentially 
behaves like the identity. 
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1.2 Statement of Results 



All the results essentially follow from the following two basic lemmas on how orthonormal subspaces 
behave with respect to the row-sampling. These are discussed more thoroughly in Section [31 but 
we state them here sumararily. 

Theorem 2 (Symmetric Orthonormal Subspace Sampling). Let U G ]j mxd ft e orthonormal, and 
S G M. dxd be positive diagonal. Assume the row-sampling probabilities pt satisfy 

uJS 2 u t 

Then, if r > (Ap(S)/0e 2 ) In ™, with probability at least 1-5, 

I S 2 - SU T Q T QUS I < e|| S || 2 



We also have an asymmetric version of this result, which is actually obtained through an appli- 
cation of this result to a composite matrix. 

Theorem 3 (Asymmetric Orthonormal Subspace Sampling). Let W G R mxdl , V G R mxd 2 ^ e 
orthonormal, and let Si G R dlXdl and S2 G W l2Xd2 be two positive diagonal matrices; let p^ = p(Sj). 
Consider row sampling probabilities 

Pi + 92 

If r > (8(pi + P2)/I3e 2 ) In 2 ( dl + d2 ) ; f nen yjiffr probability at least 1 — 5, 

I SiW T VS 2 - SiW T Q T QVS 2 I < e|| Si |||| S 2 || 



We note that these row sampling probabilit ies are not the usual product row sampling proba- 
bilities one uses for matrix multiplication as in brineas et ail f|2006al l. Unfortunately, there is one 



small problem with these probabilities. As can be seen, they require some knowledge regarding the 
spectral norms of Sj. In the statement of these results, since the Sj are given diagonal matrices, it 
is easy to compute || S i \ . In the application of these results to matrix multiplication, the spectral 
norm of the input matrices will appear. We will show how to handle this issue later. We now give 
some applications of these orthonormal subspace sampling results. 

Theorem 4 (Matrix Multiplication in Spectral Norm). Let A G R mxdl and B G M mxd2 have 
rescaled rows = a^/f A || and ht = ht\\ B || respectively. Let pa (resp. pb ) be the stable rank of A 
(resp B). Obtain a sampling matrix Q G ]R rxm using row-sampling probabilities pt satisfying 

Pt > * — * , l — = 8 1 1 . 

" E^ia t T a t + b^b t H PA + PB 

Then, if r > 8(p ^+/ b) ln 2(dl + d2) , with probability at least 1-5, 

I A T B- A T B II < e|| A II B ||. 
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Theorem 5 (Sparse Row-Based Matrix Reconstruction). Let A have the SVD representation A 
USV T , and consider row-sampling probabilities pt satisfying p t > ^uju t . Then, if r > (4(d 
j3)/ j3e 2 ) In with probability at least 1 — 5, 

/ii \ 1/2 

|| A - Aflfe ]| < ( — -) |A-A fc |, 

for k = 1, . . . ,d, where fi^ projects onto the top k right singular vectors of A.. 

Theorem 6 (Relative Error £2 Regression). Let A € M. mxd have the SVD representation A 
USV T , and let y G R m . Let x* = A+y be the optimal regression with residual e = y — Ax* 
y — AA + y. Assume the sampling probabilities pt satisfy 




For r > (8(d + l)/(3e 2 ) In 2 ( rf + 1 ) ; \ e t x = (QA) + Qy be the approximate regression. Then, with 
probability at least 1 — 35, 



II Ax - y || < ^i + e + e ^l±fj |Ax*-y|. 

In addition to sampling according to u| we also need the residual vector e = y — AA + y. 
Unfortunately, we have not yet found an efficient way to get a good approximation (in some form 
of relative error) to this residual vector. 

1.2.1 Approximating the Sampling Probabilities 

We show how to construct approximate sampling probabilities efficiently so that the algorithms 
may run in sub-SVD time. The details are given in Section [7| 

• • 11 2 ii ii 2 

Matrix Multiplication The sampling probabilities depend on || A || and || B || . It is possible 
to get a constant factor approximation to || A || 2 (and similarly || B || 2 ) with high probability. First 
sample A = QA according to probabilities pt = a 2 /|| A \ F . These probabilities are easy to compute 
in 0(mdi). By an application of the symmetric subspace sampling theorem (Theorem [20]) . if 
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f > (4/Oa/c 2 ) In t 1 , then with probability at least 1 — 5, 



(l-e)||A|| 2 < I A T A || < (l + e)||A|| 2 . 

We now run Q(ln^) power iterations starting from a random isotropic vector to estimate the 
spectral norm of A T A. The efficiency is 0(md\ + pAd\ In 2 tj-). 

Theorem 7. With r > (4p A /e 2 ) In the spectral norm estimate a\ obtained after cln-r power 
iterations on A A starting from an isotropic random vector satisfies 

1 "A|| 2 <a 2 <-||A|| 2 . 



Further, the estimate a\ can be computed in 0(mdi + pAd\ In 2 ^). 

Note that we only need a constant factor approximation to the spectral norm to get a con- 
stant factor approximation to the probabilities, which is all we need for the matrix multiplication 
algorithm to maintain the same asymptotic efficiency. 
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Sparse Row-Based Matrix Reconstruction We use random embeddings via Fast Johnson- 
Lindenstrauss Transformations to approximate the probabilities p% = u 2 jd. In particular, we show 
that it is possible to approximately sample in o(md 2 ) according to the leverage scores. While the 
efficiency savings are not heroic, this is (to our knowledge) the first sub-SVD algorithm to sample 
according to the leverage scores. 

Theorem 8. There is an algorithm which constructs estimates pt such that with probability at least 
1-5 



^ { — i 2 J 

clog m a 

for t = 1, . . . , m; the running time is in O (^rndlog m + (md log d + log . 

We note that we have a polylog factor approximation to the probabilities; this results in a 
polylog bloat to the efficiencies of all the matrix algorithms. Nevertheless, for m = o(e d ), the 
algorithms remain sub-SVD. 

1.3 Paper Outline 

Next we describe some probabistic tail inequalities which will be useful. We continue with the 
sampling lemmas for orthonormal matrices, followed by the applications to matrix multiplication, 
matrix reconstruction and ^-regression. Finally, we discuss the algorithms for approximating the 
sampling probabilities efficiently. 



2 Probabilistic Tail Inequalities 

Since all our arguments involve high probability results, our main bounding tools will be probability 
tail inequalities. First, let X\, . . . , X n be independent random variables with EpQ] = and |X,| < 
7; let Z n = - Y^Ii=\ Xi- Chernoff, and later Hoeffding gave the bound 

Theorem 9 (jChernofl] (|l952h : iHoeffdind (|l963h ). P[|Z n | > e] < 2e~ ne '' W '. 

If in addition one can bound the variance, K[Xf] < s 2 , then we have Bernstein's bound: 

Theorem 10 dBernsteinl (|l924h ). P[\Z n \ > e] < 2e- ne2/{2s2+2 ^ t/3) . 

Note that when e < 3s 2 /7, we can simplify the Bernstein bound to P[|Z n | > e] < 2e~ ne2 ^ s2 , 
which is considerably simpler and only involves the variance. The non-commutative versions of these 
bounds, which extend these inequalities to matrix valued random variables can also be deduced. Let 
Xi, . . . ,X n be independent copies of a symmetric random matrix X, with E[X] = 0, and suppose 
that I X || 2 < 7; let Z„ = I Y^=i x i- lAhlswede and Winter! (|2002l ) gave the fundamental extension 
of the exponentiation trick for computing Chernof f bounds of scalar random variables to matrix 
valued random variables (for a simplified proof, see Wigderson and Xiao ( 20081 )): 



Z n 1 2 > e] < inf 2de- net l~<\\ E [e tx ^] f™. 



(1) 



By standard optimization of this bound, one readily obtains the non-commutative tail inequality 
Theorem 11 dAhlswede and Winter! <|2002h ). P[|| Z n || 2 > e] < 2de~ ne2/ ^ 2 . 
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Proof. The statement is trivial if e > 7, so assume e < 7. The lemma follows from ([T]) and the 
following sequence after setting t = e/27 < i: 

II E [e' x ^] | 2 < 1 + E ^ E [|| (X/7)* || 2 ] < 1 + t 2 < / , (2) 

where (a) follows from E[X] = 0, the triangle inequality and || E [•] || 2 < E[|| • || 2 ]; (b) follows because 
||X/7|| 2 < 1 and t < \. ■ 

(We have stated a s implified ver s ion o f the bound, without taking care to optimize the con- 
stants.) As mentioned in iGross et all h00$ ). one can obtain a non-commuting version of Bernstein's 



inequality in a similar fashion using (pQ). Assume that || E X T X || 2 < s 2 . By adapting the standard 
Bernstein bounding argument to matrices, we have 



Lemma 12. || E [e' x/7 ] || 2 < exp ( 4j(e* - 1 - t) 
Proof. As in (|2|), but using (via submultiplicativity) || (X/7)^ || 2 < s 2 j E ~ 2 /^ e = s 2 /"/ 2 , 

2 00 +1 2 

I ne tXh \ | 2 < 1 + = l + ^(e'-l-t)< exp (£( e *-l_t)). 

7 „__ 2 «■ 7 



Using Lemma [T2l in (P) with t = ln(l + ej/s 2 ), and using (1 + x) ln(l + ^) — 1 > 2x +2/3 ' we 
obtain the following result. 

Theorem 13 (Non-commutative Bernstein). P[|| Z n || 2 > e] < 2(fe~ ne2/(2s2+27e/3) . 

Gross et al\ ()2009h gives a simpler version of the non-commutative Bernstein inequality. If 



X E ]R rf i xrf 2 j s no ^ symmetric, then by considering 

Odixdi X 
X T 0d 2 xd 2 _ 

one can get a non-symmetric verision of the non-commutative Chernoff and Bernstein bounds, 
Theorem 14 (jRechtJ (fiooil ^. P[|| Z n |L > e] < (di + d 2 ) e -^7(2s 2 +2 7 73) _ 



For most of our purposes, we will only need the symmetric version; again, if e < 3s 2 '7, then we 
have the much simpler bound P[|| Z n || 2 > e] < 2de~ ne / 4s . 

3 Orthonormal Sampling Lemmas 

Let U E U mxci be an orthonormal matrix, and let S E M. dxd be a diagonal matrix. We are interested 
in the product US E PJ nxc '; US is the matrix with columns XJ^Su. Without loss of generality, 
we can assume that S is positive by flipping the signs of the appropriate columns of U. The 
row-representation of U is U T = [ui, . . . ,u m ]; we consider the row sampling probabilities 

P( >ft u?s2ui 



trace (S ) 

Since U T U = Idxd, one can verify that trace(S 2 ) = ^t u ?S 2 u t is the correct normalization. 
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Lemma 15 (Symmetric Subspace Sampling Lemma). 
P[||S 2 -SU T Q T QUS|| >e||S|| 2 ] < 2d ■ exp 

< 2d ■ exp 



-re 2 



2(j,/p- K -* + e(p/P-K-*)/3) 
-r/3e 2 



4p 

where p is the numerical (stable) rank of S, p(S) = || S \\ 2 F /\\ S || 2 , and k is the condition number, 
k(S) = cr max (S)/cr min (S). 

Remarks. The stable rank p < d measures the effective dimension of the matrix. The condition 
number k > 1, hence the simpler version of the bound, which is valid for e < 3. It immediately 
follows that if r > (Ap/Pe 2 ) In then with probability at least 1 — 5, 

I S 2 - SU T Q T QUS I < e|| S || 2 

An important special case is when S = Idxd, in which case p = d, k = 1 and || S || = 1. 

8 T 

d u t ] 

-Pre 2 



Corollary 16. For sampling probabilities pt > 4ujuf, 



I - U T Q T QU I > e] < 2d ■ exp 



4(d-0)J' 



Proof, (of Lemma [To]) Note that U T Q T QU = £ Yll=i n U u tJPti> where tj G [1, m] is chosen accord- 
ing to the probability pf. . It follows that 

S 2 - SU T Q T QUS = - y S 2 - — SutX.S = - V Xi, 
r i Pu 1 T r-f 

where X, are independent copies of a matrix-random variable X ~ S 2 — Suu T S 2 /p. We prove the 
following three claims: 

(i) E [X] = 0; 

(«) ||X|| < ||S|| 2 (p/^-«- 2 ); 
(n) I EX T X|| < \\§\\\p/p- 

2 

The Lemma follows from the non-commutative Bernstein bound with e replaced by e|| S || . To 
prove (i), note that E[X] = S 2 — S E [uu T /p]S = S 2 — S (X^t=i U * U D S = 0, because YlT=i u * u * = 
u T u = l dxd . 

To prove (ii), let z be an arbitrary unit vector and consider 



z 



T Xz = z T S 2 z - -(z T Su) 2 . 



P 

It follows that z T Xz < || S || 2 . To get a lower bound, we use p > /3u T S 2 u/trace(S 2 ): 

trace(S 2 ) (z T Su) 2 



z T Xz > z T S 2 z 



2„ ' 



P u T S 2 u 

> I si 2 f ^*^ trace(s 



SI 2 pwsf 



SI 2 



K 2 pj' 



s 



(a) follows because: by definition of a m - m , the minimum of the first term is <7 ^ in ; and, by Cauchy- 
Schwarz, (z T Su) 2 < (z T z)(u T S 2 u). Since < 1, p/(3-k~ 2 > 1, and so |z T Xz| < || S || 2 (p/fi - k~ 2 ), 
from which (ii) follows. 

To prove (Hi), first note that 

E[X T X] = S 4 -S 3 E[uu T /p]S-SE[uu T /p]S 3 + SE[uu T S 2 uu7p 2 ]S, 

^^-u^S 2 u t ujJ S-S 4 . 

(a) follows because E[uu T /p] = I. Thus, for an arbitrary unit z, we have 

m 1 

z T E[X T X]z = y-(z T SiWSz)uTS 2 u t -z T S 4 z T , 

tip* 

(«) trace(S 2 ) zTg ^ ^ _ 

(6) , „ l|4 Arace(S 2 ) z T S 2 z T z T S 4 z 



IIS 



Arace(S 2 ) z T S 2 z T z T S 4 z T \ 

V /3||S|| 2 js] 2 ||S|| 4 J 
4 / trace(S 2 ) _ crj in \ 

V /3||S|| 2 " ||S |V ' 



(a) follows from p t > /3uJS 2 u t /trace(S 2 ); (b) follows from U T U = YltLi u * u ? = Irfxd- Thus, 
|z T E [X T X]z| < I S t{p /P - k" 4 ), from which (in) follows. ■ 

For the general case, consider two orthonormal matrices W G IR mX(il , V £ M mxrf2 , and two 
positive diagonal matrices Si E ^ d i xrf i anc i g 2 g ^^2xd 2 _ yy e consider the product SiW T VS2, which 
is approximated by the sampled product SiW T Q T QVS2- Consider the sampling probabilities 

KS 2 u t )V 2 KS 2 v,) 4 / 2 KS 2 u f )V 2 KS 2 v t )V 2 

Pt -^E^Ks 2 u,)V 2K s 2 v f )V 2 -^ / trace(s?)trace(si) ' 



where the last inequa lity follows from Cau chy-Schwarz, Since ||A||^ = \J p(A)|| A \ > \\A\\, one 
immediately has from lDrineas et al\ (|2006al ) (using a simplified form for the bound), 

P [|| SiW T VS 2 - SiW T Q T QVS 2 I > e|| Si || || S 2 ||] < exp ( ^-^) , (3) 



16pip 2 

where pi = p(S\) and p2 = /?(S 2 ). Alternatively, if r > (16pi/9 2 //3 2 e 2 ) In j, then 

I SiW T VS 2 - SiW T Q T QVS 2 I < e|| Si |||| S 2 ||. 

The dependence on the stable ranks and /3 is quadratic. Applying this bound to the situation 
in Lemma [15] would give an inferior bound. The intuition behind the improvement is that the 
sampling is isotropic, and so will not favor any particular direction. One can therefore guess that 
all the singular values are approximately equal and so the Frobenius norm bound on the spectral 
norm will be loose by a factor of y/p; and, indeed this is what comes out in the closer analysis. As 
a application of Lemma [T5l we can get a result for the asymmetric case. 



9 



Lemma 17. Let W G M mxdl , V E ]R mxrf 2 ft e orthonormal, and let Si G R d ^ xd ^ and S 2 G R<fe>«fe 
be two positive diagonal matrices. Consider row sampling probabilities 

Pi + 92 

Iff > (8(pi + p 2 )//3e 2 ) In 2 ( dl + da ) ) i/j en probability at least 1 — 5, 

I SiW T VS 2 - SiW T Q T QVS 2 I < e|| Si |||| S 2 || 
For the special case that Si = Id lX di and S 2 = Lj 2 x<2 2 > the sampling probabilities simplify to 



Pt>P 



wjw t + vjv t 
d\ + <i 2 



Corollary 18. If r > (8(dx + d 2 )//3e 2 )ln 2 ( dl + d2 \ then with probability at least 1-5, 



6 

I W T V - W T Q T QV I < e. 

Proof, (of Lemma [T7|) By homogeneity, we can without loss of generality assume that || Si 
I S 2 I = 1, and lei0 Z = [WSi VS 2 ]. An elementary lemma which we will find useful is 

Lemma 19. For any matrix A = [Ai A 2 ], 



I Ax I, I A 2 ||}<||A|| < VIlAi || 2 + || A 2 || 2 . 

The left inequality is saturated when Ai and A 2 are orthogonal (A^A 2 = 0), and the right 
hand inequality is saturated when Ai = A 2 . By repeatedly applying Lemma \W\ one can see that 
I A I is at least the spectral norm of any submatrix. Introduce the SVD of Z, 

Z = [WSi VS 2 ] = USV T . 

We use our typical row sampling probabilities according to US, 

Pt > P- 



trace(S^ 



We may interpret the sampling probabilities as follows. Let zt be a row of Z, the concatenation of 
two rows in WSi and VS 2 : zj = [w^Si vJS 2 ]. We also have that zj = u^S. Hence, 

ujS 2 u t = zjz t = w[S 2 w t + vjS 2 v t . 

These are exactly the probabilities as claimed in the statement of the lemma (modulo the rescaling) . 
Applying Lemma [151 if r > (4/?//3e 2 ) In 2 rai ^( u ) ; then with probability at least 1 — 5, 



I S 2 - SU T Q T QUS I < e|| S || 2 < eyj || Si || 2 + || S 2 || 2 = eV2, 
where the second inequality follows from Lemma [19j Since ZV = US, 



Z T Z - Z T Q T QZ I = I S 2 - SU T Q T QUS ||. 



lr The general case would have been Z = J j ^ | WSi pjjVS; 
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Further, by the construction of Z, 
Z T Z - Z T Q T QZ 



Sf - SiW T Q T QWSi SiW T VS 2 - SiW T Q T QVS 2 
S 2 V T WSi - S 2 V T Q T QWSi S 2 , - S 2 V T Q T QVS 2 



By Lemma! || SiW T VS 2 - SiW T Q T QVS 2 || < || Z T Z - Z T Q T QZ ||, and so: 

I SiW T VS 2 - SiW T Q T QVS 2 I < eV2. 

Observe that trace(S 2 ) = || Z \\ 2 F = trace(S 2 ) + trace^ 2 .); further, since || S || > max{|| Si ||, || S 2 ||}, 
we have that 

trace(S 2 ) trace(Sf) + trace(S 2 ) trace(Sf) trace(S 2 ) 

P( S > = || g ||2 = j^-p ^ || Si || 2 + || g 2 || 2 = Pi + 

Since rank(U) < d\ + d 2 , it suffices that r > (A(p x + p 2 )//3e 2 ) In 2( - dl + d ^ to obtain error ey/2; after 
rescaling e' = e\/2, we have the result. ■ 

4 Sampling for Matrix Multiplication 

We obtain results for matrix multiplication directly from Lemmas [15] and [T71 First we consider the 
symmetric case, then the asymmetric case. Let A € W nxdl and B € W nxd2 . We are interested 
in conditions on the sampling matrix Q £ R rxm such that A T A ~ A A and A T B ~ A B, where 
A = QA and B = QB. Using the SVD of A, 

I A T A - A T Q T QA I = I VaSaUIUaSa V\ - V A S A UiQ T QU A SA V A || , 
= I S 2 A - SaU a Q t QUaSa I- 

We may now directly apply Lemma HU with respect to the appropriate sampling probabilities. One 
can verify that the sampling probabilities in Lemma [15] are proportional to the squared norms of 
the rows of A. 

Theorem 20. Let A G W mxdl have rows at Obtain a sampling matrix Q G W xm using row- 
sampling probabilities 

Pt > Pyvw- 

II A \\f 

Then, if r > ^-ln^p, with probability at least 1 — 5, 

I A T A- A T A|| < e||A|| 2 . 

Similarly, using the SVDs of A and B, 

I A T B — A T Q T QB I = |VaSaU1UbSbV^- V a SaUIQ t QU b SbV^||, 
= I SaU a UbSb — SaU a Q t QUbSb ||. 

We may now directly apply Lemma [T71 with respect to the appropriate sampling probabilities. One 
can verify that the sampling probabilities in Lemma [T7] are proportional to the sum of the rescaled 
squared norms of the rows of A and B. 
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Theorem 21. Let A £ R mxci i and B £ ]R mxd2 ; aaue rescaled rows k t = a t /\\ A || and b t = b t || B 
respectively. Obtain a sampling matrix Q £ ]R rxm using row-sampling probabilities 



i a t T a t + b^b t PA + PB 



JTien, i/r > s{pa+ / b) ln 2(dl + d2) , with probability at least 1 - 5, 



A T B- A T B II < ell A II B 



5 Sparse Row Based Matrix Representation 

Given a matrix A = USV T £ W mxd , the top k singular vectors, corresponding to the top k singular 
values give the best rank k reconstruction of A. Specifically, let Afc = UfcSfcVfc, where Ufc £ R mxfc , 
S fe £ R kxk and V k £ R dxk ; then, || A — A fc || < || A — X || where X £ R mxd ranges over all rank-A; 
matrices. As usual, let A = QA be the sampled, rescaled rows of A, with A = USV , and consider 
the top-A; right singular vectors Let 11^ be the projection onto this top-k right singular space, 
and consider the rank k approximation to A obtained by projecting onto this space: A& = ALTfe. 
The following lemma is useful for showing that A^ is almost (up to additive error) as good an 
approximation to A as one can get. 

Lemma 22 ( Drineas et al. ( 2006bl ) . Rudelson and Vershvnin ( 2007 )). 



I A - A fc || 2 < I A — Afc || 2 + 2|| A T A - A T A || < (|| A — A fc || + V2\\ A T A - A T A || 1/2 ) 2 - 

Proof. The proof follows using standard arguments and an application of a perturbation theory 
result due to Weyl for bounding the change in any singular value upon hermitian perturbation of 
a hermitian matrix. ■ 

Therefore, if we can approximate the matrix product A T A, we immediately get a good recon- 
struction for every k. The appropriate sampling probabilities from the previous section are 

Pt>p 



„ A II 2 ' 

II A \\F 

In this case, if r > (4o//3e 2 ) In then with probability at least 1 — 5, 

|| A - A fc || 2 < I A- Afc|| 2 + 2e||A|| 2 . 

The sampling probabilities are easy to compute and sampling can be accomplished in one pass if 
the matrix is stored row-by-row. 

To get a relative error result, we need a more careful non-uniform sampling probabilities. The 
problem here becomes apparent if A has rank k. In this case we have no hope of a relative error 
approximation unless we preserve the rank during sampling. To do so, we need to sample according 
to the actual singular vectors in U, not according to A; this is because sampling according to A 
can give especially large weight to a few of the large singular value directions, ignoring the small 
singular value directions and hence not preserving rank. By sampling according to U, we essentially 
put equal weight on all singular directions. To approximate U well, we need sampling probabilities 

. T 
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Then, from Corollary [TBI if r > (4(d — f3)//3e 2 ) In with probability at least 1 — S, 

1 1 - U T Q T QU I < e. 

Since || U || = 1, it also follows that 

I UU T - UU T Q T QUU T I < e. 
This result is useful because of the following lemma. 



Lemma 23 (jSpielman and Srivastaval (|2008h b If || UU T - UU T Q T QUU T || < e, then for every 
x G R d , 

(1 - e)x T A T Ax < x T A T Ax < (1 + e)x T A T Ax. 



Proof. We give a sketch of the proof from Spielman and Srivastava ( 20081 ). We let x ^ range 



over col(U). Since col(U) = col(A), x G col(U) if and only if for some y G R , x = Ay. Since 
rank(A) = d, Ay 7^ <J=^ y / 0. Also note that UU T A = A, since UU T is a projection operator 
onto the column space of U, which is the same as the column space of A. The following sequence 
establishes the lemma. 

II ttttt tttt t o t otttt t II |x T UU T x-x T UU T Q T QUU T x| 
|UU -UU Q QUU I = sup , 



|y T A T UU T Ay - y T A T UU T Q T QUU T Ay| 

= SU P > 

Ay^O y T A Ay 

|y T A T Ay - y T A T Q T QAy| 

= sup —j- , 

Ay^o y T A Ay 

|y T A T Ay - y T A T Ay| 

= SU P i^-TYT , 

y ^o y T A Ay 

The lemma now follows because || UU T - UU T Q T QUU T || < e. ■ 

Via the Courant-Fischer characterization of the singular values, it is immediate from Lemma [23] 
that the singular value spectrum is also preserved : 

(1 - e)^(A T A) < ^(A T A) < (1 + e)(7j(A T A). (4) 

Lemma [23] along with (j3J) will allow us to prove the relative approximation result. 

Theorem 24. If p t > ^uju t and r > (4(d - P)/pe 2 ) In 2 f, then, for k = 1, . . . , d, 

|| A — An fe || < (— -) |A-A fc |, 
where Hk projects onto the top k right singular vectors of A. 
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1 /2 

Remarks. For e < i, ( ) < 1 + 2e. Computing the probabilities involves knowing 



2' V 1 ^ 6 , 

which means one has to perform an SVD, in which case, one could use A^; it seems like overkill 
to compute A*, in order to approximate A&. We discuss approximate sampling schemes later, in 
Section [JJ 

Proof. Let || x || = 1. The following sequence establishes the result. 

||A(I-n fc )|| 2 = sup ||Ax|| 2 = sup x T A T Ax, 

xeker(ri fc ) xeker(ri fe ) 

1 ~ T ~ 
< sup x T A Ax, 

1 ~ 6 xeker(fi fe ) 
<j fc+1 (A ? A), 



1 - e 

< ^a fc+1 (A T A) = i±^||A-A fc || 2 . 
1 — e 1 — e 



6 £2 Linear Regression with Relative Error Bounds 

A linear regression is represented by a real data matrix A £ ]^ mxd which represents m points in 
M. d , and a target vector y £ M m . Traditionally, m ^> d (severly over constrained regression). The 
goal is to find a regression vector x* E R 2 which minimizes the £2 fit error (least squares regression) 

m 

£(x) = I Ax-y|| 2 = ^(a f T x-y t ) 2 , 

We assume such an optimal x* exists (it may not be unique unless A has full column rank), and is 
given by x* = A + y, where + denotes the More-Penrose pseudo-inverse; this problem can be solved 
m 0(md 2 ). Through row-sampling, it is possible to construct x, an approximation to the optimal 
regression weights x*, which is a relative error approximation to optimal, 

£(x) < (l + e)£(x*). 

As usual, let A = UaSaV^. Then A + = VaS^U^, and so x* = VS~ 1 U T y. The predictions are 
y* = Ax* = UAU A y, which is the projection of y onto the column space of A. We define the 
residual e = y-y*=y - Ax* = (I - U A U A )y, so 

y = U A U A y + e. (5) 

We will construct A and y by sampling rows: 

[A,y] = Q[A,y], 

and solve the linear regression problem on (A,y) to obtain x = A + y. For j3 G (0, |], we will use 
the sampling probabilities 
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to construct A and y. There are three parts to these sampling probabilities. The first part allows 
us to reconstruct A well from A; the second allows us to reconstruct A T e; and, the third allows us 
to reconstruct e. 

Note that A = QUaSaVa t ; if QUa consisted of orthonormal columns, then this would be 
the SVD of A. Indeed, this is approximately so, as we will soon see. Let the SVD of A be 
A = U A S A V A T . Let U = QU A . Since p t > (3uf/d, it follows from Corollary M that if r > 2^=#, 
for e G (0, 1), then, with high probability, 

1 1 - U T U I < e. 

Since the eigenvalues of I — LJ T U are given by 1 — cr?(U), it follows that 

1 - e < cr?(U) < 1 + e. 

So all the singular values of Ua are preserved after sampling. Essentially, it suffices to sample 
r = 0(dlnd/e 2 ) rows to preserve the entire spectrum of Ua- By choosing (say) e = ^, the 
rank of Ua is preserved with high probability, since all the singular values are bigger than |. 
Thus, with high probability, rank(A) = rank(UA) = rank(QUA) = rank(UA) = rank(A). Since 
QUa has full rank, SqJj a is defined, and Squ a — SqJj a is a diagonal matrix whose diagonals are 

(of (U) — l)/cij(U); thus, I Squ a — SqJj a || < e/y/l — e. This allows us to quantify the degree to 
which QUa is orthonormal, because 

|(QU A ) + - (QU A ) T | 2 = II VquaSq^Uqu/ - V QUa Squ a U^ Ua || 2 

= I S QU A - S QUa ll 2 < ^f=- 

Finally, we can get a convenient form for A + = (QA) + , because QA = QUaSaV a has full rank, 
and so QUa = Uqtj a Squ a Vq Ua has full rank (and hence is the product of full rank matrices). 
Thus, 

(Uqu a Sq Ua Vq Ua S a V a ) + , 
Va(Squ a Vq Ua S a ) + Uq Ua , 

VaS^VquaSquaUqua' 

v a sx 1 (Qu a ) + , 



(QA)+ = 



We summarize all this information in the next lemma. 

Lemma 25. If r > (4d//3e 2 ) In with probability at least 1 — 5, all of the following hold: 

rank(A) = rank(U A ) = rank(QU A ) = rank(U A ) = rank(A); (7) 

I S QUa " S QIJa l 2 < e/VT^~e; (8) 

I (QU A ) + - (QU A ) T || 2 < e/VT^~e; (9) 

(QA)+ = V A SX 1 (QU A ) + . (10) 



In Lemma 1251 we h ave simplified the constant to 4; this is a strengthened form of Lemma 4.1 in 



Drineas et all (|2006dh : in particular, the dependence on d is near-linear. 
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Remember that x = A y; we now bound || Ax 



basically follows the line of reasoning in iDrineas et al. 
with probability at least 1 — 6, 



We only sketch the derivation which 
Under the conditions of Lemma 



Ax 



(a) 

(b) 

(c) 



(d) 
< 

(e) 
< 



AA + y-y|| = || A(QA)+Qy - y || 
U A (QU A ) + Qy-y|| 

U A (QU A )+Q(U A U A y + e) - U A U A y - e || 
U A (QU A )+Qe - e || 

U A ((QU A )+ - (QU A ) T )Qe + U A (QU A ) T Qe - e 
+ - (QU A ) T If Qe I + I U A Q T Qe || + || e || 
Qe|| + || U A Q T Qe|| + ||e||. 



(QU A 
e 



VT- 

(a) follows from Lemma l25| (b) follows from ([5]); (c) follows Lemma l25| because QU A has full 
rank and so (QU A ) + QU A = 1^; (d) follows from the triangle inequality and sub-multiplicativity 
using I U A || = 1; finally, (e) follows from Lemma l25l We now see the rationale for the complicated 
sampling probabilities. Since pt > e|/e T e, for r large enoug h, by Theorem [201 I Qe I < Nl (1 + e). 
Similarly, since U A e = 0, || U A Q T Qe || = || U A e — U A Q T Qe ||; so, we can apply Lemma [T7I with 
Si = Lj, V = e/|| e || and S2 = || e||. According to Lemma [T71 if pt > /?(u| + ef/e T e)/(c? + 1), then 
if r is large enough, || U A Q T Qe || < e|| e ||. Since these are all probabilistic statements, we need to 
apply the union bound to ensure that all of them hold. Ultimately, we have: 



Theorem 26. For sampling probabilities satisfying and for r > (8(d + l)//3e 2 )ln 
x = (QA) + Qy be the approximate regression. Then, with probability at least 1 — 35, 



2{d+l) 



let 



Ax - y I < 1 + e + e 



1 + e 



Ax* 



where x* = A + y is the optimal regression. 



Remarks. For the proof of the theorem, we observe that any transformation matrix Q satisfying 
the following three properties with high probability will do: 

I - U T Q T QU I < e; (ii)\\ Qe | < (1 + e)| e ||; (m)|| U T Q T Qe || < e|| e ||. 

We will see later that Johnson-Lindenstrauss transforms also satisfy this property, and hence can 
also be used to perform approximate linear regression with relative error guarantees. 



7 Approximating the Sampling Probabilities 

We have encountered a variety of row sampling probabilities (actually, conditions which the prob- 
abilities need to satisfy). Once we can compute probabilities satisfying these conditions, it is 
relatively straightforward to sample rows according to these probabilities. We now discuss how to 
efficiently approximate such probabilities. 
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7.1 Matrix Multiplication 



The row-norm based sampling is relatively straightforward for the symmetric product. For the 
asymmetric product, A T B, we need probabilities 

^ a iW aJat + JW hJht 
Pt > p— - J! -^ . 

Pa + Pb 

To get these probabilities, we need || A || and ||B||; since we can compute the exact product in 
0(m(f 1^2), a practically useful algorithm would need to estimate || A || and || B || efficiently. Suppose 
we had estimates Aa,Ab which satisfy: 

(1 - 6)|| A || 2 < A| < (1 + 6)|| A || 2 ; (1 - 6)|| B || 2 < A 2 < (1 + e)| B || 2 . 

We can construct probabilities satisfying the desired property with /3 = (1 — e)/(l + e). 



Pt 



> 



A\\ 2 F /Xl + \\B\\ 2 F /Xl 
A|| 2 ,/(1-6)||A|| 2 + ||B|| 2 ,/(1-6)||A|| 2 



i + ey PA + PB 

11 2 

One practical way to obtain || A || is using the power iteration. Given an arbitrary unit vector 
xo, for n > 1, let x n = A T Ax n ,„i/|| A T Ax„,_i ||. Note that multiplying by A T A can be done in 
0(2mdi) operations. By definition, || A T Ax n || < || A || 2 . We now get a lower bound. Let xq be 
a random isotropic vector constructed using d\ independent standard Normal variates z±, . . . , z^ x \ 

so Xq = [zi , . . . , Zfa ] I \j z\ + • • • + . Let A 2 = A T Ax n be an estimate for ||A|| 2 after n power 
iterations. 

Theorem 27. For some constant c < (- + 2) 3 , with probability at least 1 — 5, 



A 2 > 



All 2 



4 + ^-2"™ 



Remarks, n > log -4- gives the desired constant factor approximation. Thus in O(mdiln^) 
time, we obtain a sufficiently good estimate for || A || (and similarly for || B ||). 

Proof. Assume that xo = Ylt=i a i w ii where Vj are the eigenvectors of A T A with corresponding 
eigenvalues o\ > ••• > cj^. Note ||A|| 2 = o\. If cr^ > cj 2 /2, then it trivially follows that 
I A T Ax n I > o~f/2 for any n, so assume that a\ < o-f/2. We can thus partition the singular values 
into those at least <r 2 /2 and those which are smaller; the latter set is non-empty. So assume for 
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some k < d\, a 2 . > crf/2 and crf, +l < cr\/2. We therefore have: 



> 



(a) 
> 



Edi 2 



4(n+l) 



£—11=1 



EK 2 ^ 



2 4(n+l) 



E* 



2 4(n+l) 



Eli a?of» + E£ fc+ i 



-i 1 



2 af n 
.. \&{n+l) 



a 2 af n 



E 4 =i«i(^/ cr i) 



Eti«f(^M) 4n + Et fe+ i«f(^i) 



\4n ' 



Ek 
i=l 



a 



2 M/^) 4(n+1) 



4E-=i«?(^M) 4(n+1) + 2- 



(6) 

> 



4 + 2-VEti«K^M) 4(n+1) ' 



4 + 2-™/af 



2 ' 



(a) follows because for i > k + 1, erf < af/2; for i <= k, erf /erf < 4; and Ei>fc+i a i — Ei>i a i = !• 

(b) follows because Ei=i a f ( cr i/ (J i) 4( ' n+1 ' ) > a i- The theorem will now follow if we show that with 
probability at least 1 
of generality, assume vi 
standard normals). For 



cS 1 / 3 , af > 5/d. It is clear that E[a 2 ] = 1/d from isotropy. Without loss 
is aligned with the z\ axis. So a 2 = z\ / ^ zf (z±, . . . , z^ are independent 
5 < 1, we estimate P[a 2 > 5/d] as follows: 



9 ^ 



> 



(a) 



(6) 

> 



4 > s_ 
E^-d 

5 



zi > 



E 



i>2 



tE 



i>2 



2 > 5 v 2 



2 > 5 + <5 2/3 



d- 1 



In (a) we compute the probability that a Xi random variable exceeds a multiple of an independent 
Xd-i random variable, which follows from the definition of the x 2 distribution as a sum of squares 
of independent standard normals, (b) follows from independence and because one particular real- 
ization of the event in (a) is when x\ > ^ + ^ 2//3 anci x\-\ < <5 + 5 2 l 3 . Since ¥.{xfi_ 1 /{d — 1)] = 1, 
and Far[x^_ 1 /(d — 1)] = 2/(d — 1), by Chebyshev's inequality, 



d- 1 



_i < <5 + <5 2/3 



> 1 



From the definition of the xf distribution, we can bound P[xf < <5 + 5 2/3 ], 

2V2r(l/2) Jo 



du u- l ' 2 e- u l 2 < 



(5 + 5 2 ' 3 ) 1 ' 2 , 
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and so 



> 1 



d-1 



> 1 



- + 2 

7T 



We end this section with an alternate sampling based approach to estimate the spectral norm, 
which can be combined with the power iteration. The basic idea is to do a pre-sampling of the 
rows of A according to the row norms to construct A. We know that if r > {Ap^/ j3e 2 ) In then 



A T A 



A T A II < ell A I 



It follows that we have a e-approximation to the spectral norm from 



A'A 
A T A 



A A 
A T A 



A T A + A T A|| < (l + e)||A| 



~ T ~ ~* T ~ 

A A + A A 



< e||A|| 2 + ||A T A 



Thus, ( 1 — e) || A || < I A A I < (1 + e)|| A || . Along this route, one must first sample r rows, and 
then approximate the spectral norm of the resulting A. We may use the power iteration above to 
get a constant factor approximation (which is all we need), or we may compute exaactly in 0{rd\). 



7.2 Sparse Matrix Representation 

The sampling probabilities pt are required to satisfy 

Pl>- d * t . 

At first sight, it appears that we would need to perform an SVD to merely obtain these sam- 
pling probabilities. However, since we only need to approximate these sampling probabilit ies, we 
can le verage some recent results from the use of random projections for matrix algorithms ISarlos 
n a very active current stream of re search, starting with Johnson-Lindenstrauss' origina l 
result on embeddings via random projections (] Johnson and Lindenstrausd . ll984nAchlioptasl . l2003l ). 
all the algorithms described here can be accomplishe d in comparable space and time comple xity 
using fast Johnson-Lindenstrauss transforms, FJLTs, ( Ailon and Chazelle . 20061 : Sarlos . 20061 ). In 
that context, the algorithms achieve their goal by quickly constructing a small number of random 
linear combinations of the rows. Our focus is to quickly construct a small number of actual rows. 
Nevertheless, those techniques which quickly construct a small number of linear combinations are 
usefull for approximating the probabilities which can then be used to construct the desired actual 
rows. 

First, some brief background. Let V = {vi, . . . , v^} where Vj S R m and \V\ = d. A matrix 
R G ]^ rx ' m is a Johnson-Lindenstrauss transform (JLT) for V if, for all x. £ V, 



(l-e)||x|r<||Rx||<(l + e )||x|r. (11) 

We generally assume m 3> d. The seminal result of Johnson and Lindenstraussl ( 1984 ) is that there 
exist such transforms with r = O(^Togd). Further, it is possible to find such matrices efficiently 
using randomized constructions. Specifically , if r = f?(-^ log rflog 4 ), the n a normalized random 
matrix (standard normals or r andom signs (Arriaga and Vempala . 20061 )) will yield such a JLT 
with probability at least 1 — 5. lAilon and Chazelle (|200fSh showed that by preconditioning with a 
randomized Hadamard matrix, a significantly sparser JLT can be constructed, which meant that 
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the matrix multiplications can be computed more efficiently using FFT techniques; we denote this 
a fast Johnson-Lindenstrauss transform, or FJLT. We will use FLJT to refer to a specific FLJT or, 
when the context is clear, to denote a probability distribution over R rxm which generates an FLJT 
with high probability. Specifically, it is possible to construct R G l rxm with r > log (i log j 
such that with probability at least 1 — 5, holds; and, for all v G IR m , computing Rv takes 
0(m log m + rlog 2 d) time. This means that for a matrix A G ]R mxrf , we can construct RA in 
0(md log m + rd log 2 d) time. 



Let A G 



pmxd 



have SVD A = USV T ; we also write U T = [ui, . . . , u m ]. Let e\, . . . , e m be the 
standard basis vectors in M m . The columns of U together with the standard basis vectors form 
a set of m + d vectors. We can project down to Q(\ logjd + m)) dimen sions and still preserve 



the relationships (including angles) between all these vectors Magen ( 20021 ) . Further, if we project 
down to 0(Jjdlog ^) dimensions, then RU remains ne ar orthonorma l. Summarizing this discussion, 
together with using Lemmas 6,10 and Corollary 11 in ISarlosI ()2006h . we have the following lemma. 



Lemma 28. Let R G W xrn be a FJLT, with r > -^(dlog f +log(d+m)) log Then, with probability 
at least 1 — 5, all the following hold: 



(i) For i = 1, . . . 



m. 



I - U T R T RU II < e; 



Re* I < (1 + e); 



eJUU T R T Re, 



eJUU T e, 



< ell U T e, 



(ii) For X G R mxd , RX can be computed in 0(mdlogm + rdlog 2 d); for X G R dxr , XR can be 
computed in 0(mdlog m + rd log 2 d). 

Note that the first part o f (ii) is from Ailon and Chazelle ( 20061 ); the second part also follows 
exactly the same reasoning in lAilon and Chazelld (|2006l ) using the fact that R = PHD, where P is 
sparse, H is Hadamard and D is diagonal. Note that FLJTs can be used to directly perf orm matrix 
multiplication, get sparse matrix representations and do linear regression (jSarlosl . [20061 ) . All these 
algorithms would be in terms of a small number of linear combinations of the rows, not the actual 
rows themselves. The first part of (i) in Lemma [28] is all that one needs for Lemma [251 to hold. 

Lemma 29. Under the assumptions o f Lemma [28[ with probability at least 1 — 5, all of the following 
hold: 



e; 



rank(RA) = rank(RU A ^ 
I Sr Ua - Sj^j a I < e/Vl 
I (RU A ) + - (RU A ) T I < e/Vl 
(RA)+ = V A SX 1 (RU A ) + . 



rank(UA) = rank(A); 



e; 



(12) 
(13) 
(14) 
(15) 



7.2.1 Constructing The Probabilities 

We need to estimate u?. Observe that 



U7 



e JUU T e t = efAA+et. 



The costly part here is the computation of A + , and this is where the FJLT comes in. Let R be a 
FJLT as constructed in Lemma [28l Then, the previous discussion suggests that the projection by 
R should preserve the angles between at and A + e^. Thus, we expect 



AA + e ( « A(RA)+Re t . 
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Hence, we could approximate u 2 by eJA(RA) + Re£. The only thing we have to be careful about is 
that this estimate is not necessarily positive, so we will threshold it from below. The algorithm to 
compute the probabilities is given by 



1: Set e = \\J dl °l l m in Lemma M to obtain R. 

2: Compute X = (RA)+R. 

3: for t = 1 . . . , m do 

4: Compute the estimate wt = ajX^. 

5: Set wt = max{e 2 , u>t}. 

6: end for 

7: pt = Wt/ Wt (Normalizing) . 



For the algorithm described, we obtain the following bound on for the probabilities pt- 
Theorem 30. For m > |dlog 2 m, 

1 u 2 

Pi Z i 2 4- 

clog to a 



Remarks. The requirement on m is simply to ensure that e < |; it is benign and would be satisfied 
provided that m = uo(d\og 2 d). It is typically the case that to 3> d. We note that for application in 
our matrix algorithms, (3 = 0(1/ log 2 m); since the row complexity and computational complexity 
of all the algorithms is 0(1/ (3), this will bloat those by a factor of log 2 m. As we will see our 
algorithm constructs the approximation pt in o(md 2 ) under certain restritions on to; it is open 
whether the probabilities can be approximated better that 0(1/ log 2 to) in o(md 2 ) time 



Computational Complexity By the choice of e, from Lemma [28l we need r > c lo ™ m log ^ 
(where we assume to 3> d). From Lemma [28l computing RA takes 0(md(log to + log d log |)) time 
to compute, and the pseudo-inverse takes 0(rd 2 ) = 0( ^ m log ^); computing (RA) + R also takes 
0(TOd(logTO + logiilog ^)) time; finally, given X, computing p t takes 0(d) for a total 0(md). Thus, 
we have a total complexity of 

O (md\ogm + (md\ogd+ ^)log |j . 

For suitable to, i.e. logm, = o(d), this complexity is o(md 2 ). 



7.2.2 Bounding the Accuracy of the Probability Estimates 

We now prove Theorem [30j we show that the estimate pt constructed in the previous section does 
indeed produce probabilities which estimate u 2 /d up to logarithmic factors. We assume that in the 
construction of R for the algorithm, r is set according to Lemma [28l 

Lemma 31. For to > |<ilog 2 TO and a constant c > |, 

I u 4 1| 2 - ce|| Uf I < w t < I Ui || 2 + ce|| u t \\. 

Proof. For an arbitrary basis vector et, define e^ = AA + et which is the best approximation to et 
in the column space of A. Since AA + is a projection operator, e^e^ = etW T et = \\ e^ || 2 . Let 
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e f = A(RA) + Re t , and consider || eje t — ejejf \\. 

I e f T e t - e^e t * || ( = } || e t T U(RU)+Re t - e[UU T e ( || 

= I e[U((RU)+ - (RU) T + (RU) T )Re t - e[UU T e ( || 

< I e^U 1 1 (RU)+ - (RU) T 1 1 Re t || + || e^UU T R T Re t - e[UU T e ( || 
(6) e (l + e ) 

< ^Z-l l U T e t I + I e^UU T R T Re t - eJUU T e 4 1| 

< ^l^l + elU-e,! 
VI - e 

(d) 

< ce\\u t \\. 

(a) and (b) follow from Lemma [29] (c) from Lemma [28] and, (d) assumes that e < | which follows 
from the definition of e and the condition assumed on m. Notice that ejet is exactly the estimate 
W(, and ejejf = || Ut || 2 ; the lemma follows. ■ 

Define e' = ce. The next corollary is immediate from the lower bound in Lemma [31] and 
elementary calculus. 

Corollary 32. w t > -\c 2 e 2 . 

It is also useful to invert the bounds in Lemma [31] to get bounds for || || 2 in terms of Wf 

Corollary 33. w t + \c 2 e 2 - \ce^/c 2 e 2 + 4w t < \\ u t f < w t + \c 2 e 2 + \ce^c 2 e 2 + Aw t 

Proof. Using the quadratic formula, 

I u t \\ 2 + ce|| u t \\>w t =^> (I u t I + ±ce + ±Vc 2 e 2 + Aw t ){\ u t || + \ce - \^c 2 e 2 + Aw t ) > 0; 
I u t \\ 2 - ce|| u t \\<w t =>- (I u t I - ice + ±Vc 2 e 2 + 4w t ){\\ u t || - \ce - \yjc^t l + 4w t ) < 0. 

In the first inequality, the first term is positive, so the second must also be positive: 

I u t I > \^Jc 2 e 2 + Aw t - ^ce. 

In the second inequality, if the second term is positive, the first cannot be negative, so it must be 
that the second term is negative and the first is positive, and so 

I u t I > \ce - \\Jc 2 e 2 + 4w t ; 
I ^ I < ±ce + \\/c 2 e 2 + 4w t - 

Combining these three inequalities and squaring concludes the proof. ■ 

The appearence of || || in the bound of Lemma [311 is a serious problem, because it means that 
wt is not a relative approximation to || || , which would have been the ideal situation. This poses 
a serious problem if we are to normalize the wt by wt to obtain probabilities. If there are too 
many small uj's which appear large due to the error from the projection, then when we normalize, 
this inflates the importance of the small u^'s. This is why in the algorithm, we thresholded the w^s 
to obtain the wts. Since we will be normalizing by the sum, we will need bounds on the sums. 
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Lemma 34. For m > |<ilog 2 m and a constant c < 1 + y/7, 

m 

(1 - ce)d < w t < (1 + ce)d. 
t=i 

Proof. We use the fact that YT=i &t = E™i e f T U(RU)+Re t = trace(U(RU)+R): 

|trace(U(RU)+R) - d\ = |trace(U((RU) + - (RU) T )R) + trace(U(RU) T R) - d\ 

= |trace(((RU)+ - (RU) T )RU) + trace(U T R T RU) - d\ 

0) r- 

< Vd\ ((RU)+ - (RU) T )RU \\ F + |trace(U T R T RU) - d\ 



( c ) r- 

< Vd\\ (RU) + - (RU) T 1 1 RU || F + |trace(U T R T RU) - d\. 

(a) is from the cyclic property of the trace, (b) follows from the facts that || X || F = traceX T X, and 
the bound on the trace for rank d matrices obtained using the Cauchy-Schwarz inequality: 



trace(X) = £ <7;(X) < £ h(X)| < /^> 2 (X) = Vd\\ X | F . 

i i y i 

Finally, (c) follows because ||XY|| F < ||X||||Y|| F . Note that ||RU|| F = trace(U T R T RU). By 
Lemma ESI |a 2 (U T R T RU) - 1| < e; hence (1 - e)d < Ei^ 2 (U TRTRU ) < 0- + e K Thus ; usin g 
Lemma n 



3 



To complete the proof, note that the condition assumed on m makes e < | 

We are now ready to complete the proof of Theorem [30j We first recap the definition of wt ■ 



w t 



e 2 w t < e 2 ; 
w t w t > e 2 . 



Since wt > — |ce 2 , it follows that w t <w t + e 2 {l + f). Thus, using the definition of e, 



Y, w t < {l + c'e)d + me 2 (l + §) 
t=i 

log 2 m I 



= d^(l+ C ' e ) + ^(l + |)^ 

< cd log 2 m 

If wt < e 2 , then by Corollary [33] (replacing iot with e 2 in the upper bound), 



< ce 2 = cwt- 

If > e 2 , then, again, by Corollary [331 (replacing e 2 with iu t in the upper bound), 

K 2 | < w t + -c 2 w t (l + Vl + 4/c 2 ) 

< ciSt = cwj. 
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In both cases, wt > ^uf , which concludes the proof, because 

Li=i TO ( clog m a 

7.3 £2 Regression 

The probabilities depend on (the leverage scores) and et (the components of the residual error). 
Though the previous ideas would apply to constructing approximations to the leverage scores, 
it is not clear how one could get near relative error approximations to the components of the 
residual error. Indeed, sampling algorithms for £2 regression may have non-zero values for estimated 
residuals where the actual residual is zero. We would certainly need that when the actual residual 
converges to zero in some components, then so does the estimated residual in those components. 
Though we can get a 1 + e approximation to the sum of squared residuals, this appears to be 
of not much help to get the squared residuals themselves. An algorithm to construct a good 
approximation to the actual residuals would then mean that an efficient row-sampling algorithm 
for the £2 regression can also be obtained. As of yet, ithas been elusive. 

References 

Achlioptas, D. (2003). Database- friendly random projections: Johnson- lindenstrauss with binary coins. 
Journal of Computer and System Sciences, 66(4), 671-687. 

Achlioptas, D. and McSherry, F. (2005). On spectral learning of mixtures of distributions. In Proc. 18th 
Conference on Learning Theory (COLT), pages 458-469. 

Achlioptas. D. and McSherry, F. (2007). Fast computation of low-rank approximations. Journal of ACM , 
54(2), Article 10. 

Achlioptas, D., Fiat, A., Karlin, A., and McSherry, F. (2001). Web search via hub synthesis. In Proc. 42nd 
IEEE symposium on Foundations of Computer Science, pages 500-509. 

Ahlswede, R. and Winter, A. (2002). Strong converse for identification via quantum channels. Information 
Theory, IEEE Transactions on, 48(3), 569 -579. 

Ailon, N. and Chazelle, B. (2006). Approximate nearest neighbors and the fast Johnson-Lindcnstrauss 
transform. In Proc. 38th ACM Symposium on Theory of Computing, pages 557-563. 

Arriaga, R. I. and Vempala, S. (2006). An algorithmic theory of learning: Robust concepts and random 
projection. Mach. Learn., 63(2), 161-182. 

Azar, Y., Fiat, A., Karlin, A., McSherry, F., and Saia, J. (2001). Spectral analysis of data. In Proc. 33rd 
ACM symposium on Theory of computing, pages 619-626. 

Bernstein, S. (1924). On a modification of chebyshev's inequality and of the error formula of laplace. Ann. 
Sci. Inst. Sav. Ukraine, 4(5), 38-49. 

Berry, M. W., Dumais, S. T., and O'Brien, G. W. (1995). Using linear algebra for intelligent information 
retrieval. SIAM Rev., 37(4), 573-595. 

Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of 
observations. Annals of Mathematical Statistics, 23(4), 493-507. 

Deshpande, A. and Vempala, S. (2006). Adaptive sampling and fast low-rank matrix approximation. In 
Proc. 10th RANDOM. 



24 



Deshpande, A., Rademacher, L., Vempala, S., and Wang, G. (2006). Matrix approximation and projective 
clustering via volume sampling. In Proc. 17th ACM-SIAM symposium on Discrete algorithms, pages 
1117-1126. 

Drincas, P., Kerenidis, I., and Raghavan, P. (2002). Competitive recommendation systems. In Proc. 34th 
ACM symposium on Theory of computing, pages 82-90. 

Drincas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. (2004). Clustering large graphs via the 
singular value decomposition. Machine Learning, 56(1-3), 9-33. 

Drincas, P., Kannan, R., and Mahoney, M. W. (2006a). Fast Monte Carlo algorithms for matrices i: Ap- 
proximating matrix multiplication. SIAM Journal on Computing, 36(1), 132-157. 

Drincas, P., Kannan, R., and Mahoney, M. W. (2006b). Fast Monte Carlo algorithms for matrices ii: 
Computing a low rank approximation to a matrix. SIAM Journal on Computing, 36(1), 158-183. 

Drineas, P., Kannan, R., and Mahoney, M. W. (2006c). Fast Monte Carlo algorithms for matrices hi: 
Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1), 184- 
206. 

Drincas, P., Mahoney, M. W., and Muthukrishnan, S. (2006d). Sampling algorithms for 12 regression and 
applications. In Proc. 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127-1136. 

Drineas, P., Mahoney, M. W., and Muthukrishnan, S. (2006e). Subspace sampling and relative-error matrix 
approximation: Column-row based methods. In Proc. 14-th ESA. 

Frieze, A., Kannan, R., and Vempala, S. (1998). Fast monte-carlo algorithms for finding low-rank approxi- 
mations. In Proc. 39th Annual Symposium on Foundations of Computer Science, pages 370-378. 

Golub, G. and Van Loan, C. (1983). Matrix Computations . Johns Hopkins University Press, Baltimore. 

Golub, G. and van Loan, C. (1996). Matrix computations. The Johns Hopkins University Press, London, 3 
edition. 

Gross, D., Liu, Y.-K., Steven Flammia, S. T., Becker, S., and Eiscrt, J. (2009). Quantum state tomography 
via compressed sensing, preprint available at: arXiv:0909.3304v2. 

Hocffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the Amer- 
ican Statistical Association, 58(301), 13-30. 

Johnson, W. and Lindcnstrauss, J. (1984). Extensions of lipschitz mappings into a hilbcrt space. Contem- 
porary Mathematics , 26, 189-206. 

Kannan, R., Salmasian, H., and Vempala, S. (2008). The spectral method for general mixture models. SIAM 
Journal of Computing, 38(3), 1141-1156. 

Klcinbcrg, J. (1999). Authoritative sources in a hypcrlinked environment. Journal of the ACM, 46(5), 
604-632. 

Magen, A. (2002). Dimensionality reductions that preserve volumes and distance to affine spaces, and their 
algorithmic applications. In Randomization and Approximation Techniques in Computer Science, volume 
2483 of Lecture Notes in Computer Science, pages 239-253. Springer, Berlin. 

Magen, A. and Zouzias, A. (2010). Low rank matrix-valued chcrnoff bounds and applications, submitted. 
http://arxiv.org/abs/1005.2724. 

McSherry, F. (2001). Spectral partitioning of random graphs. In Proc 42nd IEEE symposium on Foundations 
of Computer Science, pages 529-537. 



25 



Papadimitriou, C. H., Tamaki, H., Raghavan, P., and Vempala, S. (2000). Latent semantic indexing: A 
probabilistic analysis. Journal of Computer and System Sciences, 61(2), 217-235. 

Recht, B. (2009). A simpler approach to matrix completion, working paper. 

Rudelson, M. and Vershynin, R. (2007). Sampling from large matrices: An approach through geometric 
functional analysis. Journal of the ACM, 54(4), 19. 

Sarlos, T. (2006). Improved approximation algorithms for large matrices via random projections. In Proc. 
47th IEEE Symposium on Foundations of Computer Science, pages 143-152. 

Spiclman, D. A. and Srivastava, N. (2008). Graph sparsification by effective resistances. In STOC '08: 
Proc. 40th ACM Symposium on Theory of Computing , pages 563-568. 

Wigdcrson, A. and Xiao, D. (2008). Derandomizing the ahlswcdc-wintcr matrix-valued chernoff bound using 
pessimistic estimators, and applications. Theory of Computing , 4(3), 53-76. 



26 



